Extracting Recurrent Phrases and Terms from Texts Using a Purely Statistical Method

نویسندگان

  • Zhao-Ming Gao
  • Harold L. Somers
چکیده

Most statistical measures for extracting interesting word pairs such as MI and t-score require a large corpus to work well. This paper evaluates some of the most widely used statistical measures and introduces a method that can identify significant bigrams in relatively small texts by adapting Fung and Church's (1994) K-vec algorithm, which was originally designed to extract word correspondences from unaligned parallel corpora. The proposed method captures the linguistic generalisation abou lexical patterning in texts and can identify recurrent co-occurring word sequences, which might be phrases, terms, or unknown words. In addition, it has the potential of identifying key phrases and terms that reveal topicality in a text.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multimodal Comparable Corpora as Resources for Extracting Parallel Data: Parallel Phrases Extraction

Discovering parallel data in comparable corpora is a promising approach for overcoming the lack of parallel texts in statistical machine translation and other NLP applications. In this paper we propose an alternative to comparable corpora of texts as resources for extracting parallel data: a multimodal comparable corpus of audio and texts. We present a novel method to detect parallel phrases fr...

متن کامل

Comparing the E ect of Syntactic vs . StatisticalPhrase Indexing Strategies for

In this paper we describe the results of experiments contrasting syntactic phrase indexing with statistical phrase indexing for Dutch texts. Our results showed that we at least need a compound splitting algorithm for good quality retrieval for Dutch texts. If we then add either syntactic or statistical phrases, performance generally improves, but this eeect is never statistically signiicant. If...

متن کامل

IndexFinder: A Knowledge-based Method for Indexing Clinical Texts

Extracting key concepts from clinical texts for indexing is an important task in implementing a medical digital library. Several methods are proposed in the literature for mapping free text into terms controlled by the Unified Medical Language System (UMLS). They are, however, not appropriate for building a fast online application. MatMap and other methods use natural language processing (NLP) ...

متن کامل

Commonsense Causal Reasoning between Short Texts

Commonsense causal reasoning is the process of capturing and understanding the causal dependencies amongst events and actions. Such events and actions can be expressed in terms, phrases or sentences in natural language text. Therefore, one possible way of obtaining causal knowledge is by extracting causal relations between terms or phrases from a large text corpus. However, causal relations in ...

متن کامل

Corpus Based Method of Transforming Nominalized Phrases into Clauses for Text Mining Application

Nominalization is a linguistic phenomenon in which events usually described in terms of clauses are expressed in the form of noun phrases. Extracting event structures is an important task in text mining applications. To achieve this goal, clauses are parsed and the argument structure of main verbs are extracted from the parsed results. This kind of preprocessing has been commonly done in the pa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998